Text Data Analysis (Youtube Case-study)¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [ ]:
 

Sentiment Analysis¶

In [2]:
comments = pd.read_csv(r'C:\DA_BA_material/UScomments.csv' , error_bad_lines=False)
C:\Users\Benny\AppData\Local\Temp\ipykernel_10884\2660278793.py:1: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.


  comments = pd.read_csv(r'C:\DA_BA_material/UScomments.csv' , error_bad_lines=False)
Skipping line 41589: expected 4 fields, saw 11
Skipping line 51628: expected 4 fields, saw 7
Skipping line 114465: expected 4 fields, saw 5

Skipping line 142496: expected 4 fields, saw 8
Skipping line 189732: expected 4 fields, saw 6
Skipping line 245218: expected 4 fields, saw 7

Skipping line 388430: expected 4 fields, saw 5

C:\Users\Benny\AppData\Local\Temp\ipykernel_10884\2660278793.py:1: DtypeWarning: Columns (2,3) have mixed types. Specify dtype option on import or set low_memory=False.
  comments = pd.read_csv(r'C:\DA_BA_material/UScomments.csv' , error_bad_lines=False)
In [3]:
comments.head()
Out[3]:
video_id comment_text likes replies
0 XpVt6Z1Gjjo Logan Paul it's yo big day ‼️‼️‼️ 4 0
1 XpVt6Z1Gjjo I've been following you from the start of your... 3 0
2 XpVt6Z1Gjjo Say hi to Kong and maverick for me 3 0
3 XpVt6Z1Gjjo MY FAN . attendance 3 0
4 XpVt6Z1Gjjo trending 😉 3 0
In [4]:
comments.isnull()
Out[4]:
video_id comment_text likes replies
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
... ... ... ... ...
691395 False False False False
691396 False False False False
691397 False False False False
691398 False False False False
691399 False False False False

691400 rows × 4 columns

In [5]:
comments.isnull().sum()
Out[5]:
video_id         0
comment_text    25
likes            0
replies          0
dtype: int64
In [6]:
comments.dropna(inplace=True)
In [7]:
comments.isnull().sum()
Out[7]:
video_id        0
comment_text    0
likes           0
replies         0
dtype: int64
In [8]:
!pip install textblob
Requirement already satisfied: textblob in c:\users\benny\anaconda3\lib\site-packages (0.15.3)
Requirement already satisfied: nltk>=3.1 in c:\users\benny\anaconda3\lib\site-packages (from textblob) (3.8.1)
Requirement already satisfied: click in c:\users\benny\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (8.0.4)
Requirement already satisfied: joblib in c:\users\benny\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (1.2.0)
Requirement already satisfied: regex>=2021.8.3 in c:\users\benny\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (2022.7.9)
Requirement already satisfied: tqdm in c:\users\benny\anaconda3\lib\site-packages (from nltk>=3.1->textblob) (4.65.0)
Requirement already satisfied: colorama in c:\users\benny\anaconda3\lib\site-packages (from click->nltk>=3.1->textblob) (0.4.6)
In [9]:
from textblob import TextBlob
In [10]:
comments.head(6)
Out[10]:
video_id comment_text likes replies
0 XpVt6Z1Gjjo Logan Paul it's yo big day ‼️‼️‼️ 4 0
1 XpVt6Z1Gjjo I've been following you from the start of your... 3 0
2 XpVt6Z1Gjjo Say hi to Kong and maverick for me 3 0
3 XpVt6Z1Gjjo MY FAN . attendance 3 0
4 XpVt6Z1Gjjo trending 😉 3 0
5 XpVt6Z1Gjjo #1 on trending AYYEEEEE 3 0
In [11]:
TextBlob("Logan Paul it's yo big day ‼️‼️‼️").sentiment.polarity
Out[11]:
0.0
In [12]:
comments.shape
Out[12]:
(691375, 4)
In [13]:
sample_df = comments[0:1000]
In [14]:
sample_df.shape
Out[14]:
(1000, 4)
In [15]:
polarity = []

for comment in comments['comment_text']:
    try:
        polarity.append(TextBlob(comment).sentiment.polarity)
    except:
        polarity.append(0)
In [16]:
len(polarity)
Out[16]:
691375
In [17]:
comments['polarity'] = polarity
In [18]:
comments.head(5)
Out[18]:
video_id comment_text likes replies polarity
0 XpVt6Z1Gjjo Logan Paul it's yo big day ‼️‼️‼️ 4 0 0.0
1 XpVt6Z1Gjjo I've been following you from the start of your... 3 0 0.0
2 XpVt6Z1Gjjo Say hi to Kong and maverick for me 3 0 0.0
3 XpVt6Z1Gjjo MY FAN . attendance 3 0 0.0
4 XpVt6Z1Gjjo trending 😉 3 0 0.0

WordCloud Analysis¶

Graphical representation of text frequency

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [19]:
filter1 = comments['polarity']==1
In [20]:
comments_positive = comments[filter1]
In [ ]:
 
In [21]:
filter2 = comments['polarity']==-1
In [22]:
comments_negative = comments[filter2]
In [ ]:
 
In [ ]:
 
In [23]:
comments_positive.head(5)
Out[23]:
video_id comment_text likes replies polarity
64 XpVt6Z1Gjjo yu are the best 1 0 1.0
156 cLdxuaxaQwc Power is the disease.  Care is the cure.  Keep... 0 0 1.0
227 WYYvHb03Eog YAS Can't wait to get it! I just need to sell ... 0 0 1.0
307 sjlHnJvXdQs This is priceless 0 0 1.0
319 sjlHnJvXdQs Summed up perfectly 0 0 1.0
In [ ]:
 
In [24]:
!pip install wordcloud
Requirement already satisfied: wordcloud in c:\users\benny\anaconda3\lib\site-packages (1.9.2)
Requirement already satisfied: numpy>=1.6.1 in c:\users\benny\anaconda3\lib\site-packages (from wordcloud) (1.24.3)
Requirement already satisfied: pillow in c:\users\benny\anaconda3\lib\site-packages (from wordcloud) (9.4.0)
Requirement already satisfied: matplotlib in c:\users\benny\anaconda3\lib\site-packages (from wordcloud) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (23.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\benny\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\benny\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
In [25]:
from wordcloud import WordCloud , STOPWORDS
In [26]:
set(STOPWORDS)
Out[26]:
{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'also',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 "can't",
 'cannot',
 'com',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'doing',
 "don't",
 'down',
 'during',
 'each',
 'else',
 'ever',
 'few',
 'for',
 'from',
 'further',
 'get',
 'had',
 "hadn't",
 'has',
 "hasn't",
 'have',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'hence',
 'her',
 'here',
 "here's",
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 "how's",
 'however',
 'http',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'k',
 "let's",
 'like',
 'me',
 'more',
 'most',
 "mustn't",
 'my',
 'myself',
 'no',
 'nor',
 'not',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'otherwise',
 'ought',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r',
 'same',
 'shall',
 "shan't",
 'she',
 "she'd",
 "she'll",
 "she's",
 'should',
 "shouldn't",
 'since',
 'so',
 'some',
 'such',
 'than',
 'that',
 "that's",
 'the',
 'their',
 'theirs',
 'them',
 'themselves',
 'then',
 'there',
 "there's",
 'therefore',
 'these',
 'they',
 "they'd",
 "they'll",
 "they're",
 "they've",
 'this',
 'those',
 'through',
 'to',
 'too',
 'under',
 'until',
 'up',
 'very',
 'was',
 "wasn't",
 'we',
 "we'd",
 "we'll",
 "we're",
 "we've",
 'were',
 "weren't",
 'what',
 "what's",
 'when',
 "when's",
 'where',
 "where's",
 'which',
 'while',
 'who',
 "who's",
 'whom',
 'why',
 "why's",
 'with',
 "won't",
 'would',
 "wouldn't",
 'www',
 'you',
 "you'd",
 "you'll",
 "you're",
 "you've",
 'your',
 'yours',
 'yourself',
 'yourselves'}
In [27]:
comments['comment_text']
Out[27]:
0                         Logan Paul it's yo big day ‼️‼️‼️
1         I've been following you from the start of your...
2                        Say hi to Kong and maverick for me
3                                       MY FAN . attendance
4                                                trending 😉
                                ...                        
691395                                               Лучшая
691396    qu'est ce que j'aimerais que tu viennes à Roan...
691397                            Ven a mexico! 😍 te amo LP
691398                                      Islığı yeter...
691399    Kocham tą piosenkę😍❤❤❤byłam zakochana po uszy ...
Name: comment_text, Length: 691375, dtype: object
In [28]:
type(comments['comment_text'])
Out[28]:
pandas.core.series.Series
In [29]:
total_comments_positive = ' '.join(comments_positive['comment_text'])
In [30]:
wordcloud = WordCloud(stopwords=set(STOPWORDS)).generate(total_comments_positive)
In [31]:
plt.imshow(wordcloud)
plt.axis('off')
Out[31]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [32]:
total_comments_negative = ' '.join(comments_negative['comment_text'])
In [33]:
wordcloud2 = WordCloud(stopwords=set(STOPWORDS)).generate(total_comments_negative)
In [34]:
plt.imshow(wordcloud2)
plt.axis('off')
Out[34]:
(-0.5, 399.5, 199.5, -0.5)

Emoji's Analysis¶

In [ ]:
 
In [ ]:
 
In [35]:
!pip install emoji==2.2.0
Requirement already satisfied: emoji==2.2.0 in c:\users\benny\anaconda3\lib\site-packages (2.2.0)
In [36]:
import emoji
In [37]:
emoji.__version__
Out[37]:
'2.2.0'
In [38]:
comments['comment_text'].head(6)
Out[38]:
0                    Logan Paul it's yo big day ‼️‼️‼️
1    I've been following you from the start of your...
2                   Say hi to Kong and maverick for me
3                                  MY FAN . attendance
4                                           trending 😉
5                              #1 on trending AYYEEEEE
Name: comment_text, dtype: object
In [ ]:
 
In [39]:
comment = 'trending 😉'
In [40]:
[char for char in comment if char in emoji.EMOJI_DATA]
Out[40]:
['😉']
In [ ]:
 
In [41]:
emoji_list = []
for char in comment:
    if char in emoji.EMOJI_DATA:
        emoji_list.append(char)
In [42]:
emoji_list
Out[42]:
['😉']
In [ ]:
 
In [ ]:
 
In [43]:
all_emojis_list = []

for comment in comments['comment_text'].dropna():
    for char in comment:
        if char in emoji.EMOJI_DATA:
            all_emojis_list.append(char)
In [44]:
all_emojis_list[0:10]
Out[44]:
['‼', '‼', '‼', '😉', '😭', '👍', '🏻', '❤', '😍', '💋']
In [ ]:
 
In [45]:
from collections import Counter
In [46]:
Counter(all_emojis_list).most_common(10)
Out[46]:
[('😂', 36987),
 ('😍', 33453),
 ('❤', 31119),
 ('🔥', 8694),
 ('😭', 8398),
 ('👏', 5719),
 ('😘', 5545),
 ('👍', 5476),
 ('💖', 5359),
 ('💕', 5147)]
In [47]:
Counter(all_emojis_list).most_common(10)[0][0]
Out[47]:
'😂'
In [48]:
Counter(all_emojis_list).most_common(10)[1][0]
Out[48]:
'😍'
In [49]:
Counter(all_emojis_list).most_common(10)[2][0]
Out[49]:
'❤'
In [50]:
emojis = [Counter(all_emojis_list).most_common(10)[i][0]for i in range(10)]
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [51]:
Counter(all_emojis_list).most_common(10)[0][1]
Out[51]:
36987
In [52]:
Counter(all_emojis_list).most_common(10)[1][1]
Out[52]:
33453
In [53]:
Counter(all_emojis_list).most_common(10)[2][1]
Out[53]:
31119
In [54]:
freqs = [Counter(all_emojis_list).most_common(10)[i][1]for i in range(10)]
In [55]:
freqs
Out[55]:
[36987, 33453, 31119, 8694, 8398, 5719, 5545, 5476, 5359, 5147]
In [ ]:
 
In [56]:
import plotly.graph_objs as go
from plotly.offline import iplot
In [57]:
trace = go.Bar(x=emojis , y=freqs)
In [58]:
iplot([trace])
In [ ]:
 
In [ ]:
 

Data Collection¶

In [59]:
import os
In [60]:
files = os.listdir(r'C:\DA_BA_material\additional_data')
In [61]:
files
Out[61]:
['CAvideos.csv',
 'CA_category_id.json',
 'DEvideos.csv',
 'DE_category_id.json',
 'FRvideos.csv',
 'FR_category_id.json',
 'GBvideos.csv',
 'GB_category_id.json',
 'INvideos.csv',
 'IN_category_id.json',
 'JPvideos.csv',
 'JP_category_id.json',
 'KRvideos.csv',
 'KR_category_id.json',
 'MXvideos.csv',
 'MX_category_id.json',
 'RUvideos.csv',
 'RU_category_id.json',
 'USvideos.csv',
 'US_category_id.json']
In [ ]:
 
In [62]:
files_csv = [file for file in files if '.csv' in file]
In [63]:
files_csv
Out[63]:
['CAvideos.csv',
 'DEvideos.csv',
 'FRvideos.csv',
 'GBvideos.csv',
 'INvideos.csv',
 'JPvideos.csv',
 'KRvideos.csv',
 'MXvideos.csv',
 'RUvideos.csv',
 'USvideos.csv']
In [ ]:
 
In [64]:
import warnings
from warnings import filterwarnings
filterwarnings('ignore')
In [ ]:
 
In [65]:
full_df = pd.DataFrame()
path = r'C:\DA_BA_material\additional_data'


for file in files_csv:
    current_df = pd.read_csv(path+'/'+file, encoding='iso-8859-1' , error_bad_lines=False)
    
    full_df = pd.concat([full_df , current_df] , ignore_index=True)
In [66]:
full_df.shape
Out[66]:
(375942, 16)
In [ ]:
 

Exporting data into csv, json, databases, etc.¶

In [67]:
full_df[full_df.duplicated()].shape
Out[67]:
(36417, 16)
In [ ]:
 
In [68]:
full_df = full_df.drop_duplicates()
In [69]:
full_df.shape
Out[69]:
(339525, 16)
In [ ]:
 
In [70]:
full_df[0:1000].to_csv(r'C:\DA_BA_material/youtube_sample.csv' , index=False)
In [71]:
full_df[0:1000].to_json(r'C:\DA_BA_material/youtube_sample.json')
In [ ]:
 
In [ ]:
 
In [72]:
from sqlalchemy import create_engine
In [73]:
engine = create_engine('sqlite:///C:\DA_BA_material/youtube_sample.sqlite')
In [ ]:
 
In [74]:
full_df[0:1000].to_sql('Users' , con=engine , if_exists='append')
Out[74]:
1000
In [ ]:
 

Analysing the most liked category¶

In [75]:
full_df.head(5)
Out[75]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé ...
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las...
2 5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146035 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► ...
3 d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095828 132239 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho...
4 2Vv-BfVoq4g 17.14.11 Ed Sheeran - Perfect (Official Music Video) Ed Sheeran 10 2017-11-09T11:04:14.000Z edsheeran|"ed sheeran"|"acoustic"|"live"|"cove... 33523622 1634130 21082 85067 https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg False False False 🎧: https://ad.gt/yt-perfect\n💰: https://...
In [ ]:
 
In [76]:
full_df['category_id'].unique()
Out[76]:
array([10, 23, 24, 25, 22, 26,  1, 28, 20, 17, 29, 15, 19,  2, 27, 43, 30,
       44], dtype=int64)
In [ ]:
 
In [77]:
json_df = pd.read_json(r'C:\DA_BA_material\additional_data/US_category_id.json')
In [78]:
json_df
Out[78]:
kind etag items
0 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
1 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
2 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
3 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
4 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
5 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
6 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
7 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
8 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
9 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
10 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
11 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
12 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
13 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
14 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
15 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
16 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
17 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
18 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
19 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
20 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
21 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
22 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
23 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
24 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
25 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
26 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
27 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
28 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
29 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
30 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
31 youtube#videoCategoryListResponse "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv... {'kind': 'youtube#videoCategory', 'etag': '"m2...
In [79]:
json_df['items'][0]
Out[79]:
{'kind': 'youtube#videoCategory',
 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ"',
 'id': '1',
 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ',
  'title': 'Film & Animation',
  'assignable': True}}
In [80]:
json_df['items'][1]
Out[80]:
{'kind': 'youtube#videoCategory',
 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/UZ1oLIIz2dxIhO45ZTFR3a3NyTA"',
 'id': '2',
 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ',
  'title': 'Autos & Vehicles',
  'assignable': True}}
In [ ]:
 
In [81]:
cat_dict = {}

for item in json_df['items'].values:
    cat_dict[int(item['id'])] = item['snippet']['title']
In [82]:
cat_dict
Out[82]:
{1: 'Film & Animation',
 2: 'Autos & Vehicles',
 10: 'Music',
 15: 'Pets & Animals',
 17: 'Sports',
 18: 'Short Movies',
 19: 'Travel & Events',
 20: 'Gaming',
 21: 'Videoblogging',
 22: 'People & Blogs',
 23: 'Comedy',
 24: 'Entertainment',
 25: 'News & Politics',
 26: 'Howto & Style',
 27: 'Education',
 28: 'Science & Technology',
 29: 'Nonprofits & Activism',
 30: 'Movies',
 31: 'Anime/Animation',
 32: 'Action/Adventure',
 33: 'Classics',
 34: 'Comedy',
 35: 'Documentary',
 36: 'Drama',
 37: 'Family',
 38: 'Foreign',
 39: 'Horror',
 40: 'Sci-Fi/Fantasy',
 41: 'Thriller',
 42: 'Shorts',
 43: 'Shows',
 44: 'Trailers'}
In [83]:
full_df['category_name'] = full_df['category_id'].map(cat_dict)
In [84]:
full_df.head(4)
Out[84]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description category_name
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé ... Music
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las... Comedy
2 5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146035 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► ... Comedy
3 d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095828 132239 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho... Entertainment
In [ ]:
 
In [ ]:
 
In [ ]:
 

Analysing whether the audience is engaged or not¶

In [85]:
plt.figure(figsize=(12,8))
sns.boxplot(x='category_name' , y='likes' , data=full_df)
plt.xticks(rotation='vertical')
Out[85]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]),
 [Text(0, 0, 'Music'),
  Text(1, 0, 'Comedy'),
  Text(2, 0, 'Entertainment'),
  Text(3, 0, 'News & Politics'),
  Text(4, 0, 'People & Blogs'),
  Text(5, 0, 'Howto & Style'),
  Text(6, 0, 'Film & Animation'),
  Text(7, 0, 'Science & Technology'),
  Text(8, 0, 'Gaming'),
  Text(9, 0, 'Sports'),
  Text(10, 0, 'Nonprofits & Activism'),
  Text(11, 0, 'Pets & Animals'),
  Text(12, 0, 'Travel & Events'),
  Text(13, 0, 'Autos & Vehicles'),
  Text(14, 0, 'Education'),
  Text(15, 0, 'Shows'),
  Text(16, 0, 'Movies'),
  Text(17, 0, 'Trailers')])
In [ ]:
 
In [ ]:
 
In [86]:
full_df['like_rate'] = (full_df['likes']/full_df['views'])*100
full_df['dislike_rate']= (full_df['dislikes']/full_df['views'])*100
full_df['comment_count_rate'] = (full_df['comment_count']/full_df['views'])*100
In [87]:
full_df.columns
Out[87]:
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description', 'category_name', 'like_rate',
       'dislike_rate', 'comment_count_rate'],
      dtype='object')
In [ ]:
 
In [88]:
plt.figure(figsize=(8,6))
sns.boxplot(x='category_name' , y='like_rate' , data=full_df)
plt.xticks(rotation='vertical')
plt.show()
In [ ]:
 
In [89]:
sns.regplot(x='views' , y='likes' , data = full_df)
Out[89]:
<Axes: xlabel='views', ylabel='likes'>
In [ ]:
 
In [90]:
full_df.columns
Out[90]:
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description', 'category_name', 'like_rate',
       'dislike_rate', 'comment_count_rate'],
      dtype='object')
In [91]:
full_df[['views', 'likes', 'dislikes']].corr()
Out[91]:
views likes dislikes
views 1.000000 0.779531 0.405428
likes 0.779531 1.000000 0.451809
dislikes 0.405428 0.451809 1.000000
In [92]:
sns.heatmap(full_df[['views', 'likes', 'dislikes']].corr() , annot=True)
Out[92]:
<Axes: >
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Analysing trending videos of Youtube¶

In [93]:
full_df.head(6)
Out[93]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description category_name like_rate dislike_rate comment_count_rate
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé ... Music 4.589104 0.253051 0.733639
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las... Comedy 12.594873 0.166363 1.284185
2 5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146035 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► ... Comedy 4.575843 0.167292 0.256342
3 d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095828 132239 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho... Entertainment 6.309630 0.094903 0.835851
4 2Vv-BfVoq4g 17.14.11 Ed Sheeran - Perfect (Official Music Video) Ed Sheeran 10 2017-11-09T11:04:14.000Z edsheeran|"ed sheeran"|"acoustic"|"live"|"cove... 33523622 1634130 21082 85067 https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg False False False 🎧: https://ad.gt/yt-perfect\n💰: https://... Music 4.874563 0.062887 0.253752
5 0yIWz1XEeyc 17.14.11 Jake Paul Says Alissa Violet CHEATED with LOGA... DramaAlert 25 2017-11-13T07:37:51.000Z #DramaAlert|"Drama"|"Alert"|"DramaAlert"|"keem... 1309699 103755 4613 12143 https://i.ytimg.com/vi/0yIWz1XEeyc/default.jpg False False False ► Follow for News! - https://twitter.com/KEE... News & Politics 7.922049 0.352218 0.927160
In [ ]:
 
In [94]:
full_df['channel_title'].value_counts()
Out[94]:
The Late Show with Stephen Colbert    710
WWE                                   643
Late Night with Seth Meyers           592
TheEllenShow                          555
Jimmy Kimmel Live                     528
                                     ... 
Daas                                    1
YT Industries                           1
BTLV Le média complémentaire          1
Quem Sabia ?                            1
Jessi Osorno                            1
Name: channel_title, Length: 37824, dtype: int64
In [95]:
cdf = full_df.groupby(['channel_title']).size().sort_values(ascending=False).reset_index()
In [96]:
cdf = cdf.rename(columns={0:'total_videos'})
In [97]:
cdf 
Out[97]:
channel_title total_videos
0 The Late Show with Stephen Colbert 710
1 WWE 643
2 Late Night with Seth Meyers 592
3 TheEllenShow 555
4 Jimmy Kimmel Live 528
... ... ...
37819 Kd Malts 1
37820 Zedan TV 1
37821 Kc Kelly - Rocketprenuer 1
37822 Kbaby 1
37823 Pavel Sidorik TV 1

37824 rows × 2 columns

In [ ]:
 
In [98]:
import plotly.express as px
In [99]:
px.bar(data_frame=cdf[0:20] , x='channel_title' , y='total_videos')
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Does Punctuations have an impact on views, likes, dislikes?¶

In [101]:
full_df['title'][0]
Out[101]:
'Eminem - Walk On Water (Audio) ft. Beyoncé'
In [102]:
import string
In [103]:
string.punctuation
Out[103]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [105]:
len([char for char in full_df['title'][0] if char in string.punctuation])
Out[105]:
4
In [ ]:
 
In [112]:
def punc_count(text):
    return len([char for char in text if char in string.punctuation])
In [ ]:
 
In [113]:
sample = full_df[0:10000]
In [114]:
sample['count_punc'] = sample['title'].apply(punc_count)
In [115]:
sample['count_punc']
Out[115]:
0       4
1       1
2       3
3       3
4       3
       ..
9995    6
9996    0
9997    1
9998    0
9999    6
Name: count_punc, Length: 10000, dtype: int64
In [ ]:
 
In [117]:
plt.figure(figsize=(8,6))
sns.boxplot(x='count_punc' , y='likes' , data=sample)
plt.show()